Prática 1: Visão Geral da Biblioteca Sklearn

Contents

Prática 1: Visão Geral da Biblioteca Sklearn#

Biblioteca Scikit-Learn#

A biblioteca sklearn é uma das mais utilizadas atualmente, tanto na academia quanto no mercado. O que faz ela tão popular é ser escrita em Python (linguagem muito usada no mercado), possuir um grupo de desenvolvedores ativo e ser Open Source.

A seguir descreveremos as principais funcionalidades da biblioteca, bem como os principais algoritmos desenvolvidos.

Métodos de Classe Mais Usados#

  • fit (X, [y]): Constrói o modelo a partir do conjunto de treinamento. No caso de algoritmo não supervisonado o y pode ser omitido.

  • predict (X): Prediz a classe ou valor de regressão.

  • predict_proba (X): Previsão das probabilidades de classe das amostras de entrada X.

  • score (X, y[, sample_weight]): Retorna o score para o conjunto de dados:

    • Para classificação é acurácia

    • Para regressão é R2

  • set_params (params): Ajusta os hiperparâmetros do algoritmo.

  • get_params ([deep]): Retorna os hiperparâmetros do algoritmo.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import make_moons, make_blobs, make_regression, make_classification
from sklearn.model_selection import train_test_split

np.set_printoptions(suppress=True)
X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
from sklearn.linear_model import LogisticRegression

lrc = LogisticRegression(random_state=42)
model = lrc.fit(X_train, y_train)

preds = model.predict(X_test)
predsp = model.predict_proba(X_test)
scores = model.score(X_test, y_test)
print("Model: ", lrc)
print("Model hps: ", lrc.get_params())
print("Model Score: ", scores)
print("Model prediction: \n", preds[:10])
print("Model prediction (proba): \n", predsp[:10])
Model:  LogisticRegression(random_state=42)
Model hps:  {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
Model Score:  0.94
Model prediction: 
 [1 2 2 0 0 0 1 2 2 2]
Model prediction (proba): 
 [[0.00010604 0.99988101 0.00001294]
 [0.00681164 0.00000441 0.99318395]
 [0.00124705 0.00063736 0.99811559]
 [0.9981574  0.00003683 0.00180576]
 [0.99644246 0.00000272 0.00355483]
 [0.99873481 0.00000168 0.00126351]
 [0.00007422 0.99987744 0.00004834]
 [0.00242916 0.00004639 0.99752445]
 [0.39413349 0.0837987  0.52206781]
 [0.00635346 0.00354697 0.99009957]]
  • transform: Aplica uma transformação nos dados

  • fit_transform: Faz fit e a seguir aplica a tranformação nos dados de entrada

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_test_ = scaler.transform(X_test)
X_test_[:10]
array([[ 1.08678365, -1.5164436 ],
       [-1.55876784,  0.65593415],
       [-1.31288369, -0.37880668],
       [ 0.24177302,  1.46105908],
       [-0.09670804,  1.81924233],
       [ 0.05017578,  1.9255206 ],
       [ 0.81429729, -1.62586094],
       [-1.46497424,  0.12512164],
       [-0.23033232, -0.0316933 ],
       [-1.01038457, -0.33584257]])
X_train_ = scaler.fit_transform(X_train)
X_train_[:10]
array([[ 0.29038368,  1.12950646],
       [-1.13999693,  0.59901388],
       [ 0.01306258,  1.97048737],
       [ 1.50049455, -2.55336023],
       [ 0.37695627, -1.46288631],
       [ 0.49285469, -1.82403643],
       [ 0.6848445 , -0.01627551],
       [-1.62026329, -0.38417894],
       [ 0.97952963,  0.19779602],
       [ 1.8370938 , -0.70870969]])

Algoritmos de Classificação#

X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=0)

df = pd.DataFrame(
    {
        'feature-1': X[:, 0],
        'feature-2': X[:, 1],
        'target': y
    }
)

features = ["feature-1", "feature-2"]
target = "target"

sns.scatterplot(data=df, x=features[0], y=features[1], hue=target)
plt.show()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
../_images/20c1268723b21e2949bbfa0e90db7d202a6fd732e0ebf3f4c46048b15f1d0cf5.png

Logistic Regression (LogisticRegression)#

from sklearn.linear_model import LogisticRegression

lrc = LogisticRegression(random_state=42)
model = lrc.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)

print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 2 2]
0.94

k-Nearest Neighbors (KNeighborsClassifier)#

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
model = knn.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)

print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 0 2]
0.9266666666666666
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)

Decision Tree (DecisionTreeClassifier)#

from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(random_state=42)
model = dtc.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)

print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 0 2]
0.9266666666666666

Naive Bayes (GaussianNB)#

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
model = gnb.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)

print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 2 2]
0.9333333333333333

Random Forest (RandomForestClassifier)#

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=42)
model = rfc.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)

print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 0 2]
0.94

Support Vector Machine (SVC)#

from sklearn.svm import SVC

svc = SVC(random_state=42)
model = svc.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)

print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 2 2]
0.9333333333333333

Multi-layer Perceptron classifier (MLPClassifier)#

from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(random_state=42)
model = mlp.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)

print(pred[:10])
print(score)
[1 2 2 0 0 0 1 2 0 2]
0.92
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:692: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(

Algoritmos de Regressão#

X, y = make_regression(
    n_samples=500,
    n_features=10,
    n_informative=8,
    noise=30,
    random_state=1
    )

df = pd.DataFrame(
    {
        'feature-1': X[:, 0],
        'target': y
    }
)

features = ["feature-1", "feature-2"]
target = "target"

sns.scatterplot(data=df, x=features[0], y=target)
plt.show()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
../_images/edbb30f362da540e44242cba920785e4d61afedcbd9bbcd03c064ea4097970a1.png

Lasso Regression (Lasso)#

from sklearn.linear_model import Lasso

las = Lasso(random_state=42)
model = las.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)

print(pred[:10])
print(score)
[-115.7708748   -46.31684795  -20.86665902 -158.62497355  382.25990628
 -124.28940267   68.21063266 -164.95106788 -328.33404498  222.81955835]
0.9469180813122322

Ridge Regression (Ridge)#

from sklearn.linear_model import Ridge

rid = Ridge(random_state=42)
model = rid.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)

print(pred[:10])
print(score)
[-122.78992165  -46.12780147  -21.94305706 -162.50410571  387.85141669
 -125.93351061   71.01896076 -167.59242227 -334.32584704  225.75003396]
0.9468979075721192

ElasticNet Regression (ElasticNet)#

from sklearn.linear_model import ElasticNet

ent = ElasticNet(random_state=42)
model = ent.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)

print(pred[:10])
print(score)
[ -71.19710668  -25.93729254  -15.55881416 -101.25000773  253.68386461
  -78.39175189   42.01147833 -104.95807672 -219.87137993  160.11027644]
0.8339397652780292

k-Nearest Neighbors (KNeighborsRegressor)#

from sklearn.neighbors import KNeighborsRegressor

knr = KNeighborsRegressor()
model = knr.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)

print(pred[:10])
print(score)
[ -99.6684154   -16.38105864  -50.66401373  -57.41325167  239.25981724
 -128.72638105   83.64068093 -146.05781319 -255.81035766  152.25888883]
0.7061793685321537

Decision Tree (DecisionTreeRegressor)#

from sklearn.tree import DecisionTreeRegressor

dtr = DecisionTreeRegressor(random_state=42)
model = dtr.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)

print(pred[:10])
print(score)
[  55.72824398  -34.50897194  -28.79172463 -145.85559682  271.30919082
  -70.81129949  171.02917036 -145.85559682 -325.1987638   274.33407544]
0.5107145300278733

Random Forest (RandomForestRegressor)#

from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(random_state=42)
model = rfr.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)

print(pred[:10])
print(score)
[  24.1702344   -42.91186327  -17.1736669  -139.23683542  218.26798239
 -118.29363653   82.02520146 -138.82174793 -263.34837058  231.63570598]
0.8015615355305776

Support Vector Machine (SVR)#

from sklearn.svm import SVR

svr = SVR()
model = svr.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)

print(pred[:10])
print(score)
[  0.44496551  -2.50843853   1.67095038  -3.91897052  16.13692994
  -7.41046195   6.61101378  -9.16400223 -11.47253449  19.69234491]
0.10988234690190346

Multi-layer Perceptron(MLPRegressor)#

from sklearn.neural_network import MLPRegressor

mlpr = MLPRegressor(random_state=42)
model = mlpr.fit(X_train, y_train)
pred = model.predict(X_test)
score = model.score(X_test, y_test)

print(pred[:10])
print(score)
[-68.09169775  -8.98875781 -19.65923616 -40.95727195 158.65427625
 -30.366409    39.17966508 -46.90013887 -99.71012589 106.20396741]
0.5630287207632332
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:692: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(

Algoritmos de Agrupamento#

X, _ = make_blobs(n_samples=500, n_features=2, centers=3, random_state=1)
X = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

df = pd.DataFrame(
    {
        'feature-1': X[:, 0],
        'feature-2': X[:, 1],
    }
)

features = ["feature-1", "feature-2"]
target = "target"

sns.scatterplot(data=df, x=features[0], y=features[1])
plt.show()
../_images/137e0b51a2363beac517f90a685a32fe4ddc6cad29f1a60b0d09a4b5ccbc01bf.png

K-Means (KMeans)#

from sklearn.cluster import KMeans

kmeans = KMeans(random_state=42)
y_ = kmeans.fit_predict(X)

print(y_[:10])
[1 4 7 3 2 3 5 2 5 6]

Agglomerative Clustering (AgglomerativeClustering)#

from sklearn.cluster import AgglomerativeClustering

agc = AgglomerativeClustering()
y_ = agc.fit_predict(X)

print(y_[:10])
[0 0 0 1 0 1 0 0 0 0]

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)#

from sklearn.cluster import DBSCAN

dbs = DBSCAN()
y_ = dbs.fit_predict(X)

print(y_[:10])
[0 0 0 0 0 0 0 0 0 0]

Técnicas de Pré-Processamento & Redução de Dimensionalidade#

Existe uma vasta gama de técnicas de pré-processamento e redução de dimensionalidade no sklearn. Você pode encontrar mais informações sobre elas nos seguintes links:

A seguir destacaremos as mais comuns bem como um exemplo de uso com código:

Imputação para Completar Valores Ausentes (SimpleImputer)#

X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=0)

X[:25, 0] = np.nan 
X[25:50, 1] = np.nan

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: np.isnan(x).sum(), [X_train, X_test, y_train, y_test]))
[32, 18, 0, 0]
from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy="mean")
imp_mean.fit(X_train)

X_train_ = imp_mean.transform(X_train)
X_test_ = imp_mean.transform(X_test)

print("Train nan [count]: ", np.isnan(X_train).sum())
print("Train nan [count]: ", np.isnan(X_test).sum())

print("Train transformed nan [count]: ", np.isnan(X_train_).sum())
print("Test transformed nan [count]: ", np.isnan(X_test_).sum())
Train nan [count]:  32
Train nan [count]:  18
Train transformed nan [count]:  0
Test transformed nan [count]:  0

Dimensiona as Features para um Determinado Intervalo (MinMaxScaler)#

X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: (np.min(x), np.max(x)), [X_train, X_test, y_train, y_test]))
[(-4.10970064619185, 6.560510824746599),
 (-3.3511606691597353, 6.213852280547424),
 (0, 2),
 (0, 2)]
from sklearn.preprocessing import MinMaxScaler

mm = MinMaxScaler()
mm.fit(X_train)

X_train_ = mm.transform(X_train)
X_test_ = mm.transform(X_test)

print("X_train min/max", (np.min(X_train_), np.max(X_train_)))
print("X_test min/max", (np.min(X_test_), np.max(X_test_)))
X_train min/max (0.0, 1.0)
X_test min/max (0.04484147438192271, 0.960195322826444)

Seleção de Features (SelectKBest)#

X, y = make_classification(n_samples=500, n_features=10, n_informative=5, random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: x.shape, [X_train, X_test, y_train, y_test]))
[(350, 10), (150, 10), (350,), (150,)]
from sklearn.feature_selection import SelectKBest

skb = SelectKBest(k=5)
skb.fit(X_train, y_train)

X_train_ = skb.transform(X_train)
X_test_ = skb.transform(X_test)

print("X_train shape: ", X_train_.shape)
print("X_test shape: ", X_test_.shape)
X_train shape:  (350, 5)
X_test shape:  (150, 5)

Principal Component Analysis (PCA)#

X, y = make_classification(n_samples=500, n_features=10, n_informative=5, random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
list(map(lambda x: x.shape, [X_train, X_test, y_train, y_test]))
[(350, 10), (150, 10), (350,), (150,)]
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
ss.fit(X_train)

X_train_ = ss.transform(X_train)
X_test_ = ss.transform(X_test)

pca = PCA(n_components=5, random_state=0)
pca.fit(X_train_)

X_train_pca = pca.transform(X_train_)
X_test_pca = pca.transform(X_test_)

print("X_train shape: ", X_train_pca.shape)
print("X_test shape: ", X_test_pca.shape)
X_train shape:  (350, 5)
X_test shape:  (150, 5)

Criando Pipelines#

Os pipelines são ferramentas especialmente úteis quando queremos aplicar diferentes técnicas antes da modelagem. Além disso, este pode nos ajudar a evitar escrever longos scripts de código.

O Pipeline aplica sequencialmente uma lista de transformações e um estimador final. Desta maneira, para as etapas intermediárias ele aplica o fit e transform automaticamente.

A seguir exemplificamos como o pipeline funciona.

Sem o uso do Pipeline#

Queremos fazer as seguintes tarefas:

  1. Inputação de valores

  2. Normalização dos dados (z-score)

  3. Redução via PCA para 5 features

  4. Modelagem por meio do Logistic Regression

X, y = make_classification(n_samples=500, n_features=10, n_informative=5, random_state=0)

X[:25, 0] = np.nan 
X[25:50, 1] = np.nan

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
imp_mean = SimpleImputer(missing_values=np.nan, strategy="mean")
imp_mean.fit(X_train)

X_train_imp = imp_mean.transform(X_train)
X_test_imp = imp_mean.transform(X_test)

ss = StandardScaler()
ss.fit(X_train_imp)

X_train_ss = ss.transform(X_train_imp)
X_test_ss = ss.transform(X_test_imp)

pca = PCA(n_components=5, random_state=0)
pca.fit(X_train_ss)

X_train_pca = pca.transform(X_train_ss)
X_test_pca = pca.transform(X_test_ss)

lr = LogisticRegression(random_state=42)
model = lr.fit(X_train_pca, y_train)
preds = model.predict(X_test_pca)
predsp = model.predict_proba(X_test_pca)
scores = model.score(X_test_pca, y_test)

print("Model: ", lr)
print("Model hps: ", lr.get_params())
print("Model Score: ", scores)
print("Model prediction: \n", preds[:10])
print("Model prediction (proba): \n", predsp[:10])
Model:  LogisticRegression(random_state=42)
Model hps:  {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
Model Score:  0.7733333333333333
Model prediction: 
 [1 1 0 0 1 0 0 1 1 0]
Model prediction (proba): 
 [[0.27950121 0.72049879]
 [0.42193184 0.57806816]
 [0.96734767 0.03265233]
 [0.58694956 0.41305044]
 [0.40789539 0.59210461]
 [0.98215815 0.01784185]
 [0.51741084 0.48258916]
 [0.36402651 0.63597349]
 [0.25303639 0.74696361]
 [0.87368411 0.12631589]]

Com o uso do Pipeline#

Queremos fazer as mesmas tarefas anteriores:

  1. Inputação de valores

  2. Normalização dos dados (z-score)

  3. Redução via PCA para 5 features

  4. Modelagem por meio do Logistic Regression

from sklearn.pipeline import Pipeline

steps = [
         ('si', SimpleImputer(missing_values=np.nan, strategy="mean")),
         ('ss', StandardScaler()),
         ('pca', PCA(n_components=5, random_state=42)),
         ('lrc', LogisticRegression(random_state=42))
        ]

pipe = Pipeline(steps)
pipe.fit(X_train, y_train)

preds = pipe.predict(X_test)
predsp = pipe.predict_proba(X_test)
scores = pipe.score(X_test, y_test)

print("Model: ", pipe)
print("Model Score: ", scores)
print("Model prediction: \n", preds[:10])
print("Model prediction (proba): \n", predsp[:10])
Model:  Pipeline(steps=[('si', SimpleImputer()), ('ss', StandardScaler()),
                ('pca', PCA(n_components=5, random_state=42)),
                ('lrc', LogisticRegression(random_state=42))])
Model Score:  0.7733333333333333
Model prediction: 
 [1 1 0 0 1 0 0 1 1 0]
Model prediction (proba): 
 [[0.27950121 0.72049879]
 [0.42193184 0.57806816]
 [0.96734767 0.03265233]
 [0.58694956 0.41305044]
 [0.40789539 0.59210461]
 [0.98215815 0.01784185]
 [0.51741084 0.48258916]
 [0.36402651 0.63597349]
 [0.25303639 0.74696361]
 [0.87368411 0.12631589]]

Seleção de Modelos#

Métricas e pontuações: quantificando a qualidade das previsões (Evaluation)#

from sklearn.metrics import balanced_accuracy_score

y_true = [0, 1, 0, 0, 1, 0]
y_pred = [0, 1, 0, 0, 0, 1]

balanced_accuracy_score(y_true, y_pred)
0.625

Cross-validation: Avalição da Performance de Modelos (CV)#

Divisão em Treinamento e Teste#

from sklearn.model_selection import train_test_split
from sklearn import datasets

X, y = datasets.load_iris(return_X_y=True)

[i.shape for i in [X, y]]
[(150, 4), (150,)]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

[i.shape for i in [X_train, X_test, y_train, y_test]]
[(105, 4), (45, 4), (105,), (45,)]

Cross Validation#

from sklearn.model_selection import cross_val_score

clf = DecisionTreeClassifier(random_state=42)
scores = cross_val_score(clf, X_train, y_train, cv=10, scoring="balanced_accuracy")
scores
array([0.91666667, 1.        , 1.        , 0.75      , 0.83333333,
       1.        , 1.        , 0.80555556, 1.        , 0.91666667])

Tuning the hyper-parameters of an estimator (HPT)#

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons

X, y = make_moons()

tune_parameters = {
    "criterion": ["gini", "entropy"],
    "max_depth": np.linspace(1, 32, 32, endpoint=True),
    "min_samples_leaf": np.linspace(0.01, 0.1, 5, endpoint=True)
    }
clf = GridSearchCV(
    estimator=DecisionTreeClassifier(),
    param_grid=tune_parameters,
    scoring="balanced_accuracy",
    cv=10,
    refit=True
)

clf.fit(X_train, y_train)
GridSearchCV(cv=10, estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13.,
       14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26.,
       27., 28., 29., 30., 31., 32.]),
                         'min_samples_leaf': array([0.01  , 0.0325, 0.055 , 0.0775, 0.1   ])},
             scoring='balanced_accuracy')
clf.score(X_test, y_test)
1.0

Salvando e Carregando modelos#

É possível salvar um modelo do scikit-learn usando o protocolo de serialização e de-serialização de objetos em Python chamado pickle:

Salvando um modelo#

from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression

X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

lr = LogisticRegression()
model = lr.fit(X_train, y_train)
import pickle

model_name = "my_model.pkl"
with open(model_name, 'wb') as file: 
  pickle.dump(model, file)

Carregando um Modelo#

import pickle

model_name = "my_model.pkl"
with open(model_name, 'rb') as file: 
  pickled_model = pickle.load(file)
preds = pickled_model.predict(X_test)
predsp = pickled_model.predict_proba(X_test)
scores = pickled_model.score(X_test, y_test)

print("Model Score: ", scores)
print("Model prediction: \n", preds[:10])
print("Model prediction (proba): \n", predsp[:10])
Model Score:  1.0
Model prediction: 
 [0 2 2 2 2 1 1 2 0 1]
Model prediction (proba): 
 [[0.99985755 0.0000996  0.00004284]
 [0.00001745 0.00016558 0.99981698]
 [0.00004946 0.00008208 0.99986846]
 [0.00004331 0.0001275  0.99982919]
 [0.00002165 0.00026253 0.99971582]
 [0.00232173 0.99603898 0.00163929]
 [0.00024187 0.99971246 0.00004567]
 [0.00038217 0.00214574 0.99747209]
 [0.99954522 0.00037273 0.00008205]
 [0.00190224 0.99772551 0.00037225]]

Prática 2: Tarefas de Aprendizado de Máquina#

Imports#

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_blobs, make_regression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.cluster import AgglomerativeClustering
from yellowbrick.contrib.classifier import DecisionViz
from yellowbrick.regressor import ResidualsPlot
from yellowbrick.regressor import PredictionError

# plt.rcParams["figure.figsize"] = (10,10)
np.set_printoptions(suppress=True)

Tarefas de AM#

Existem diversos tipos de Tarefas de Aprendizado de Máquina. A figura a seguir demonstra os mais importantes:

Dentre as mais importantes destacamos a classificação, a regressão e a clusterização.

Entender qual a tarefa mais adequada para um problema é uma das tarefas do Cientista de Dados.

Aprendizado Supervisionado#

É a tarefa que aprende uma função que mapeia uma entrada a uma saída por meio de pares de exemplos de entrada e saída.

Classificação#

Na classificação a saída é discreta, por exemplo:

  • Se vai chover ou fazer sol;

  • Se o diagnóstico para uma doença é positivo ou negativo;

  • Se houve ou não fraude em uma dada transferência bancária;

  • Se o cliente vai preferir o produto A, B ou C.

X, y = make_blobs(n_samples=500, n_features=2, centers=3, random_state=0)

df = pd.DataFrame(
    {
        'feature-1': X[:, 0],
        'feature-2': X[:, 1],
        'target': y
    }
)

features = ["feature-1", "feature-2"]
target = "target"

sns.scatterplot(data=df, x=features[0], y=features[1], hue=target);
../_images/2277dbcf3aeaab5dca5ec363eb387a650348ac56f99381eb03238845879c07d1.png
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

list(map(lambda x: x.shape, [X_train, X_test, y_train, y_test]))
[(350, 2), (150, 2), (350,), (150,)]
clf = LogisticRegression(random_state=42)

viz = DecisionViz(
    clf,
    title="Decision Tree",
    features=features
)

viz.fit(X_train, y_train)
viz.draw(X_test, y_test)
viz.show()
../_images/9b3965dee202333d913f9939e851529d00d30f374649b1bc497ff6a2e82ce25d.png
<AxesSubplot:xlabel='feature-1', ylabel='feature-2'>
clf =  DecisionTreeClassifier(
    criterion="entropy",
    max_depth=5,
    min_samples_split=10,
    random_state=42
    )

viz = DecisionViz(
    clf,
    title="Decision Tree",
    features=features
)

viz.fit(X_train, y_train)
viz.draw(X_test, y_test)
viz.show()
../_images/2ae99f87658d2828869dc90ffdcc6d1e9c776e7b0d41dc7858e95f2ae2093dc5.png
<AxesSubplot:xlabel='feature-1', ylabel='feature-2'>
clf = KNeighborsClassifier(
    n_neighbors=5,
    metric="euclidean"
    )

viz = DecisionViz(
    clf,
    title="Nearest Neighbors",
    features=features
)

viz.fit(X_train, y_train)
viz.draw(X_test, y_test)
viz.show()
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
../_images/b7704e90be5a777bcdec21dae10bf9368daa6fa97a42b41bf7e91f3f49f1e38b.png
<AxesSubplot:xlabel='feature-1', ylabel='feature-2'>

Regressão#

Na regressão a saída é contínua, por exemplo:

  • O valor de uma ação na semana seguinte;

  • A temperatura do dia seguinte;

  • A quantidade de produtos vendidos em uma campanha;

  • O preço de um insumo nos próximos meses.

X, y = make_regression(
    n_samples=500,
    n_features=2,
    n_informative=2,
    noise=30,
    random_state=1
    )

df = pd.DataFrame(
    {
        'feature-1': X[:, 0],
        'feature-2': X[:, 1],
        'target': y
    }
)

features = ["feature-1", "feature-2"]
target = "target"

sns.scatterplot(data=df, x=features[0], y=features[1], hue=target)

sns.relplot(data=df, x=features[0], y=target)
sns.rugplot(data=df, x=features[0], y=target)

sns.relplot(data=df, x=features[1], y=target)
sns.rugplot(data=df, x=features[1], y=target)
<AxesSubplot:xlabel='feature-2', ylabel='target'>
../_images/8b957e94c476ae0e7cd44fef313ae3ec17574f42e895e39a75f2756d8ff5db62.png ../_images/4f9ae244f863db48130517cb6193ca5183df595f1785b603017d96db6a9daabd.png ../_images/661471a727040f7e7c0bbc9af7e544c88fb3f2a6a509efcd932ba9eff0165f0e.png
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

list(map(lambda x: x.shape, [X_train, X_test, y_train, y_test]))
[(350, 2), (150, 2), (350,), (150,)]
model =  LinearRegression()

visualizer = PredictionError(model)
visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.show()                 # Finalize and render the figure
../_images/ef2c4052614e8963a2ea2f1b0f59dc3e00bd63cfc5582c1d9ba06878739a2efe.png
<AxesSubplot:title={'center':'Prediction Error for LinearRegression'}, xlabel='$y$', ylabel='$\\hat{y}$'>

Aprendizado Não-Supervisionado#

Nenhum rótulo é previamente conhecido, desta maneira os dados de entrada são usados para descobrir padrões ocultos e agregá-los de alguma forma.

Agrupamento#

Exemplo:

  • Segmentação de clientes

  • Motor de recomendação

  • Social Network Analysis (SNA)

X, _ = make_blobs(n_samples=500, n_features=4, centers=3, random_state=1)

df = pd.DataFrame(
    {
        'feature-1': X[:, 0],
        'feature-2': X[:, 1],
    }
)

features = ["feature-1", "feature-2"]
target = "target"

sns.scatterplot(data=df, x=features[0], y=features[1])
plt.show()
../_images/bcd44c85a89ae448052fa5705303592cba82f2e7492a28c47810f82bbea56f9f.png
for c in [2, 3, 4]:
  cluster = AgglomerativeClustering(n_clusters=c, affinity='euclidean', linkage='ward')
  y_ = cluster.fit_predict(X)
  sns.scatterplot(data=df, x=features[0], y=features[1], hue = y_)
  plt.show()
../_images/932cf196a232ba1e8228af7e7573dd728eb7a85619513d9cde8314f68f1772a7.png ../_images/4a68efa5d30f3f4d64d26f033c1cf9c51a2847f5510bf1bc23cfb2da9ef91499.png ../_images/e62c01fab22c05fe8de66bf0df2bdd2bc92eec8fe0e8a9415c71496f25b73f85.png

© ealcobaca (2022)

Prática 3 - Case: Classificação de Câncer por meio de microRNA#

Descrição & Objetivo#

Descrição dos Dados: Os dados foram coletados do The Cancer Genome Atlas (TCGA), que é um programa internacional e de referência mundial de caracterização de mais de 33 tipos de câncer. Os dados são reais e foram devidamente anonimizados. Cada linha representa a amostra retirada de uma pessoa. As colunas são os tipos de microRNA e cada entrada representa a intensidade com que aquele microRNA está expresso. Os valores de expressão variam entre \([0, \infty]\). Valores próximos a zero indicam pouca expressão enquanto que o contrário indica uma alta expressão. Os dados também apresentam rótulos (veja o atributo class) sendo TP (primary solid tumor) indicando tumor e NT (normal tissue).

Objetivo: Construir um modelo para predizer quando uma pessoa tem câncer dado um exame de sequenciamento do RNA.

Leitura dos Dados#

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.
import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.decomposition import PCA

# plt.rcParams["figure.figsize"] = (15,15)
# sns.set(rc={'figure.figsize':(150, 150)})
df = pd.read_csv("brca_mirnaseq.csv", sep=';', header=0, decimal=',')
df
hsa.let.7a.1 hsa.let.7a.2 hsa.let.7a.3 hsa.let.7b hsa.let.7c hsa.let.7d hsa.let.7e hsa.let.7f.1 hsa.let.7f.2 hsa.let.7g ... hsa.mir.941.1 hsa.mir.942 hsa.mir.943 hsa.mir.944 hsa.mir.95 hsa.mir.96 hsa.mir.98 hsa.mir.99a hsa.mir.99b class
0 8962.996542 17779.575039 9075.200383 24749.898857 341.298400 406.164781 1470.179650 14.716795 3627.642977 387.417272 ... 0.0 5.530515 0.187475 2.062226 4.124452 119.984057 53.992826 130.201449 46548.939810 TP
1 7739.739862 15524.941906 7713.626636 23374.640471 801.487258 513.297924 560.962427 20.922042 6557.093894 350.955461 ... 0.0 8.180047 0.000000 0.629234 1.258469 60.249189 86.047798 236.434808 12644.149725 TP
2 8260.612670 16497.981335 8355.342958 10957.355911 635.811272 620.351816 2694.331127 39.799878 11830.760394 600.725980 ... 0.0 3.618171 0.000000 0.767491 1.644623 97.252043 117.645369 191.434123 33083.456616 TP
3 9056.241254 18075.168478 9097.666150 26017.522731 2919.348415 334.245155 1322.434475 17.866463 6438.725384 354.957604 ... 0.0 3.478426 0.000000 3.478426 1.739213 72.572624 41.583007 1046.690127 24067.232290 TP
4 10897.303665 21822.338727 10963.956320 22204.253575 3313.009950 350.615669 1711.886682 22.541895 8246.117280 333.425447 ... 0.0 2.108235 0.000000 1.135203 0.810860 19.947145 34.380445 1081.037952 25715.275426 TP
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
837 10628.975280 21125.108661 10585.686678 23396.813364 3892.051211 367.141461 1484.663795 23.402901 10570.535667 571.680109 ... 0.0 1.217492 0.000000 0.405831 1.217492 79.813361 57.627952 1100.883277 16338.471420 TP
838 16799.785282 33603.904432 16883.338223 20731.006597 5263.331356 201.676038 2173.283559 36.888271 18227.341203 870.301142 ... 0.0 5.341744 0.000000 3.124416 2.318115 16.629958 57.348159 1919.601107 14080.736733 TP
839 13120.807001 26337.935723 13229.425112 18796.895124 6581.549565 375.598820 2547.029500 28.505268 16838.042944 778.398745 ... 0.0 1.863089 0.000000 0.558927 0.931545 41.919511 54.215901 1310.124456 17072.605898 TP
840 7979.531224 16006.280243 8106.687917 20462.010937 4040.296936 295.594442 962.166120 23.885025 7625.121634 428.411748 ... 0.0 2.070956 0.000000 2.209020 1.656765 55.225491 53.016472 1120.939408 18696.866174 TP
841 10439.110392 20880.967721 10649.126224 17770.685685 1330.766196 790.868182 1952.822603 29.966587 10936.555740 577.855691 ... 0.0 11.487192 0.000000 3.745823 2.746937 44.949881 80.160621 470.225698 34080.000799 TP

842 rows × 898 columns

df.shape
(842, 898)

Análise Exploratória dos Dados#

ax = sns.countplot(x="class", data=df)
../_images/15a2db97ba851038ad924f3b3bb5baa43dbfff1802b2dcebaa38d11634ff09d0.png
df["class"].value_counts()
TP    755
NT     87
Name: class, dtype: int64
df["class"].value_counts(normalize=True)
TP    0.896675
NT    0.103325
Name: class, dtype: float64
df.describe()
hsa.let.7a.1 hsa.let.7a.2 hsa.let.7a.3 hsa.let.7b hsa.let.7c hsa.let.7d hsa.let.7e hsa.let.7f.1 hsa.let.7f.2 hsa.let.7g ... hsa.mir.940 hsa.mir.941.1 hsa.mir.942 hsa.mir.943 hsa.mir.944 hsa.mir.95 hsa.mir.96 hsa.mir.98 hsa.mir.99a hsa.mir.99b
count 842.000000 842.000000 842.000000 842.000000 842.000000 842.000000 842.000000 842.000000 842.000000 842.000000 ... 842.000000 842.000000 842.000000 842.000000 842.000000 842.000000 842.000000 842.000000 842.000000 842.000000
mean 9218.938921 18432.504585 9289.250466 26606.604836 3152.699471 558.321269 1289.570177 24.359962 8687.461926 610.223836 ... 5.902975 0.003737 6.446279 0.061018 2.320737 3.150482 38.307053 63.746405 1034.572148 44369.112203
std 4843.796136 9704.187427 4858.691217 16745.347957 3238.003201 346.883205 763.056055 12.490091 6052.615278 317.854963 ... 8.325681 0.049274 9.541682 0.172214 6.527536 4.287594 33.791795 40.145314 1117.491608 32754.290751
min 1294.149164 2599.981125 1319.952907 1817.920354 148.795934 79.783216 161.181457 2.439034 653.474578 88.614573 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.374223 18.400719 3475.079227
25% 5902.143848 11741.467528 5933.706564 14580.357100 1276.700850 330.638301 809.867504 16.441786 4648.822942 410.859815 ... 1.378098 0.000000 2.464140 0.000000 0.373238 1.201951 14.906921 39.913493 387.430475 22769.094433
50% 8016.628565 16040.589880 8103.783439 23097.825936 2352.902327 481.342371 1101.403395 21.890340 7019.157941 532.277053 ... 3.192098 0.000000 4.127957 0.000000 1.036215 2.235731 29.634884 52.993693 710.026124 35594.670263
75% 11236.887034 22538.594950 11289.595988 34373.185504 3971.192192 681.931022 1619.864372 29.395515 10926.448322 724.277709 ... 7.159431 0.000000 7.551755 0.000000 2.345941 4.030888 51.258145 75.993914 1242.434228 53462.034662
max 45101.697434 90233.655610 45095.490102 144706.427973 59677.212349 3370.036117 11617.011618 121.408006 80780.055188 3342.745045 ... 91.996543 0.909391 184.185656 1.757516 122.685820 93.402785 259.127121 399.078716 15689.499524 248074.178531

8 rows × 897 columns

Estabelecendo um Baseline Comparativo#

Antes de qualque modelagem vamos estabelecer um baseline, i.e., uma solução simples para o problema. A ideia de adicionarmos um baseline é para guiar o experimento, de forma a compreender se todas as técnicas que estamos usando são realmente necessárias e impactam positivamente a solução. Sem um baseline o experimento acontece de forma cega e não dá para dizer se estamos seguindo o caminho certo.

X = df.drop("class", axis=1)
y = df["class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, stratify=y, random_state=42)
y_train.value_counts(normalize=True)
TP    0.896435
NT    0.103565
Name: class, dtype: float64
y_test.value_counts(normalize=True)
TP    0.897233
NT    0.102767
Name: class, dtype: float64
lrc = LogisticRegression(random_state=42)

cv_list_lr_baseline = cross_val_score(
    lrc,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
mean_cv_lr_baseline = np.mean(cv_list_lr_baseline)
std_cv_lr_baseline = np.std(cv_list_lr_baseline)

print(f"Preformance (bac): {round(mean_cv_lr_baseline, 4)} +- {round(std_cv_lr_baseline, 4)}")
Preformance (bac): 0.9201 +- 0.046

Modelagem#

knn = Pipeline(
    [
     ('mms', MinMaxScaler()),
     ('skb', SelectKBest(chi2, k=10)),
     ('knn', KNeighborsClassifier(
         n_neighbors=3, # número de vizinhos
         p=2, # parâmetro da medida Minkowski metric. p=2 é a distância euclideana
         weights="uniform", # peso de cada exemplo.
         )
     )
  ]
)


cv_list_knn_euclid = cross_val_score(
    knn,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)

mean_cv_knn_euclid = np.mean(cv_list_knn_euclid)
std_cv_knn_euclid = np.std(cv_list_knn_euclid)

print(f"Preformance (bac): {round(mean_cv_knn_euclid, 4)} +- {round(std_cv_knn_euclid, 4)}")
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
Preformance (bac): 0.9703 +- 0.0377
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
knn = Pipeline(
    [
     ('mms', MinMaxScaler()),
     ('skb', SelectKBest(chi2, k=10)),
     ('knn', KNeighborsClassifier(
         n_neighbors=3, # número de vizinhos
         p=1, # parâmetro da medida Minkowski metric. p=2 é a distância euclideana
         weights="uniform", # peso de cada exemplo
         )
     )
  ]
)


cv_list_knn_man = cross_val_score(
    knn,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)

mean_cv_knn_man = np.mean(cv_list_knn_man)
std_cv_knn_man = np.std(cv_list_knn_man)

print(f"Preformance (bac): {round(mean_cv_knn_man, 4)} +- {round(std_cv_knn_man, 4)}")
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
Preformance (bac): 0.9638 +- 0.0407
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
C:\Users\mlfernandez\Anaconda3\lib\site-packages\sklearn\neighbors\_classification.py:228: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
lr = Pipeline(
    [
     ('scaler', StandardScaler()),
     ('lr', LogisticRegression(
         penalty="l2", # penalidade, usado para evitar overfitting
         C=1, # força de regularização do modelo. Valores pequenos implicam em regularização mais forte
         fit_intercept=True, # bias ou intercepto do modelo
         class_weight="balanced", # peso das classes. Útil para datasets desbalanceados
         random_state=42
         )
     )
     ])


cv_list_lr_l2 = cross_val_score(
    lr,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)

mean_cv_lr_l2 = np.mean(cv_list_lr_l2)
std_cv_lr_l2 = np.std(cv_list_lr_l2)

print(f"Preformance (bac): {round(mean_cv_lr_l2, 4)} +- {round(std_cv_lr_l2, 4)}")
Preformance (bac): 0.9655 +- 0.0391
lr = Pipeline(
    [
     ('scaler', StandardScaler()),
     ('lr', LogisticRegression(
         penalty="l1", # penalidade, usado para evitar overfitting
         C=1, # força de regularização do modelo. Valores pequenos implicam em regularização mais forte
         fit_intercept=True, # bias ou intercepto do modelo
         class_weight="balanced", # peso das classes. Útil para datasets desbalanceados
         solver="liblinear",
         random_state=42
         )
     )
     ])


cv_list_lr_l1 = cross_val_score(
    lr,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)

mean_cv_lr_l1 = np.mean(cv_list_lr_l1)
std_cv_lr_l1 = np.std(cv_list_lr_l1)

print(f"Preformance (bac): {round(mean_cv_lr_l1, 4)} +- {round(std_cv_lr_l1, 4)}")
Preformance (bac): 0.9665 +- 0.0373
lr = Pipeline(
    [
     ('scaler', StandardScaler()),
     ('pca', PCA(n_components=10)),
     ('lr', LogisticRegression(
         penalty="l2", # penalidade, usado para evitar overfitting
         C=1, # força de regularização do modelo. Valores pequenos implicam em regularização mais forte
         fit_intercept=True, # bias ou intercepto do modelo
         class_weight="balanced", # peso das classes. Útil para datasets desbalanceados
         random_state=42
         )
     )
     ])


cv_list_lr_pca = cross_val_score(
    lr,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)

mean_cv_lr_pca = np.mean(cv_list_lr_pca)
std_cv_lr_pca = np.std(cv_list_lr_pca)

print(f"Preformance (bac): {round(mean_cv_lr_pca, 4)} +- {round(std_cv_lr_pca, 4)}")
Preformance (bac): 0.9822 +- 0.0228

Avaliação Experimental#

# resultados da cross-validacao

df_result_cv = pd.DataFrame(
    [cv_list_lr_baseline, cv_list_knn_euclid, cv_list_knn_man, cv_list_lr_l2, cv_list_lr_l1, cv_list_lr_pca],
    index=["baseline", "kNN-eucli", "kNN-man","LR-L2", "LR-L1", "LR-PCA"]
).T

df_result_cv
baseline kNN-eucli kNN-man LR-L2 LR-L1 LR-PCA
0 0.907233 1.000000 0.916667 0.990566 0.990566 0.990566
1 0.990566 0.981132 0.990566 0.888365 0.981132 0.981132
2 0.971698 0.990566 0.990566 0.990566 0.990566 0.990566
3 0.907233 0.916667 0.916667 0.916667 0.907233 0.990566
4 0.907233 1.000000 1.000000 0.990566 1.000000 1.000000
5 0.916667 0.916667 0.916667 0.916667 0.916667 0.916667
6 0.907233 0.907233 0.907233 0.981132 0.907233 0.990566
7 0.878931 0.990566 1.000000 0.990566 0.981132 0.981132
8 0.980769 1.000000 1.000000 0.990385 0.990385 0.980769
9 0.833333 1.000000 1.000000 1.000000 1.000000 1.000000
df_res = df_result_cv.stack().to_frame("balanced_accuracy")
df_res.index.rename(["fold", "pipelines"], inplace=True)
df_res = df_res.reset_index()
df_res.head(12)
fold pipelines balanced_accuracy
0 0 baseline 0.907233
1 0 kNN-eucli 1.000000
2 0 kNN-man 0.916667
3 0 LR-L2 0.990566
4 0 LR-L1 0.990566
5 0 LR-PCA 0.990566
6 1 baseline 0.990566
7 1 kNN-eucli 0.981132
8 1 kNN-man 0.990566
9 1 LR-L2 0.888365
10 1 LR-L1 0.981132
11 1 LR-PCA 0.981132
plt.figure(figsize=(10,10))

ax = sns.boxplot(x="pipelines", y="balanced_accuracy", data=df_res)
ax = sns.swarmplot(x="pipelines", y="balanced_accuracy", data=df_res, color=".40")
../_images/246b452e7a72123da8f4f1ae419c3fd4bc3b342190fb1e125148e80422baa586.png
plt.figure(figsize=(10,10))

sns.catplot(x="pipelines", y="balance
            d_accuracy", kind="violin", data=df_res)
<seaborn.axisgrid.FacetGrid at 0x7f218c1d0fd0>
<Figure size 720x720 with 0 Axes>
../_images/155f482e6675a6e786857d80296e05de8723b0130ac6ab0286491e6ddcf7f36d.png

O modelo que vamos selecionar será o Logistic Regression com PCA, pois este apresentou média/mediana competitiva e menor desvio padrão.

Por fim, vamos avaliar a performance final do nosso modelo:

# retreinar o pipeline selecionado com todos os dados de treinamento

lr = Pipeline(
    [
     ('scaler', StandardScaler()),
     ('pca', PCA(n_components=10)),
     ('lr', LogisticRegression(
         penalty="l2",
         C=1,
         fit_intercept=True,
         class_weight="balanced",
         random_state=42
         )
     )
     ])


lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
lr_pca_test = balanced_accuracy_score(y_test, y_pred) 

print("Performance: ", round(lr_pca_test, 4))
Performance:  0.972
# Confusion matrix

from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(lr, X_test, y_test)  
plt.show()
../_images/f867672d8c8ae2ca2fb8d798753960499d96562a29215113b08391bc2a5e87b2.png
ConfusionMatrixDisplay.from_estimator(lr, X_test, y_test, normalize='true')  
plt.show()
../_images/d1b4bb95ea26b5af29a3fd5a2f9bc65ce58e4e79b8a3e5415fee3b2bcb8bb9ac.png

Referências & Links#

  1. The Cancer Genome Atlas Program

  2. Micro RNA

  3. Sklearn Pipeline

Prática 4: Classificação de medicamentos usando árvore de decisão#

Descrição & Objetivo#

Descrição dos Dados: Uma empresa farmacêutica desenvolveu e testou dois diferentes medicamentos para o tratamento de uma doença. Os pesquisadores perceberam que o remédio A se comportava melhor para alguns pacientes enquanto que o B foi melhor para outro grupo de pacientes. Foram coletadas as seguintes características dos pacientes: idade (Age), sexo (Sex), pressão sanguínea (BP) e nível de colesterol (Cholesterol). Você foi acionado pela equipe para construir uma solução automática para recomendar o melhor medicamento. Contudo, como se trata de um medicamento, é esperado que esta recomendação seja transparente, i.e., o paciente precisa entender exatamente o motivo de tal recomendação.

Objetivo: Construir regras claras e bem definidas para recomendar o melhor medicamento dado as características do paciente.

1. Leitura dos Dados#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


from sklearn.model_selection import cross_val_score
from sklearn.tree import plot_tree, DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import balanced_accuracy_score
# Leitura de dados
data = pd.read_csv("drug_data.csv")
data
Age Sex BP Cholesterol Drug
0 23 F HIGH HIGH B
1 28 F NORMAL HIGH A
2 61 F LOW HIGH B
3 22 F NORMAL HIGH A
4 49 F NORMAL HIGH B
... ... ... ... ... ...
140 72 M LOW HIGH B
141 46 F HIGH HIGH B
142 52 M NORMAL HIGH A
143 23 M NORMAL NORMAL A
144 40 F LOW NORMAL A

145 rows × 5 columns

2. Análise Exploratória dos Dados#

# Contar classes
data["Drug"].value_counts()
B    91
A    54
Name: Drug, dtype: int64
# Porcentagem de exemplos de cada classe
data["Drug"].value_counts(True)
B    0.627586
A    0.372414
Name: Drug, dtype: float64
# Verificar se tem NaN
data.isna().sum()
Age            0
Sex            0
BP             0
Cholesterol    0
Drug           0
dtype: int64
# Descrever dados categóricos
data.describe(include=[object])
Sex BP Cholesterol Drug
count 145 145 145 145
unique 2 3 2 2
top F NORMAL NORMAL B
freq 74 59 78 91
# Descrever dados numéricos
data.describe(include=[np.number])
Age
count 145.000000
mean 43.848276
std 16.755319
min 15.000000
25% 30.000000
50% 43.000000
75% 58.000000
max 74.000000

3. Modelagem & Avaliação#

3.1 Baseline#

# Separar os dados em treinamento e teste

categorical_features = ["Sex", "BP", "Cholesterol"]
numerical_features = ["Age"]

X = data.drop(columns=["Drug"])
y = data["Drug"].astype('category').cat.codes
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
lrc = LogisticRegression(random_state=42)

cv_list_lr_baseline = cross_val_score(
    lrc,
    X_train[numerical_features],
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)
mean_cv_lr_baseline = np.mean(cv_list_lr_baseline)
std_cv_lr_baseline = np.std(cv_list_lr_baseline)

print(f"Performance (bac): {round(mean_cv_lr_baseline, 4)} +- {round(std_cv_lr_baseline, 4)}")
Performance (bac): 0.5 +- 0.0

3.2 Tratando Dados Categóricos#

Embora os algoritmos de árvore de decisão tenham sido inicialmente criados para lidar com dados categóricos, os algoritmos do sklearn só trabalham com dados numéricos. Portanto, precisamos transformar os atributos categóricos para numéricos usando, por exemplo, o OneHotEncoder ou OrdinalEncoder.

# Esse comando vai transformar as categorias em números
oe = OrdinalEncoder()
oe.fit(data[["Sex", "BP"]])
oe.transform(data[["Sex", "BP"]])[:10]
array([[0., 0.],
       [0., 2.],
       [0., 1.],
       [0., 2.],
       [0., 2.],
       [1., 2.],
       [1., 1.],
       [0., 0.],
       [1., 1.],
       [0., 1.]])
# Podemos transformar as features de forma diferente
# Para isso podemos usar o ColumnTransformer

tr = ColumnTransformer(
    transformers=[
        ("cat-enc", OrdinalEncoder(), categorical_features),
        # ("min-max", MinMaxScaler(), numerical_features)
    ]
)

tr.fit_transform(X_train)[:10]
array([[0., 2., 1.],
       [1., 1., 1.],
       [0., 1., 1.],
       [0., 0., 1.],
       [1., 1., 1.],
       [1., 2., 1.],
       [0., 1., 1.],
       [0., 0., 0.],
       [0., 2., 0.],
       [0., 0., 0.]])
# Podemos transformar as features de forma diferente
# Para isso podemos usar o ColumnTransformer

tr = ColumnTransformer(
    transformers=[
        ("cat-enc", OrdinalEncoder(), categorical_features),
        ("min-max", MinMaxScaler(), numerical_features)
    ]
)

tr.fit_transform(X_train)[:10]
array([[0.        , 2.        , 1.        , 0.08474576],
       [1.        , 1.        , 1.        , 0.50847458],
       [0.        , 1.        , 1.        , 0.61016949],
       [0.        , 0.        , 1.        , 0.06779661],
       [1.        , 1.        , 1.        , 0.57627119],
       [1.        , 2.        , 1.        , 0.13559322],
       [0.        , 1.        , 1.        , 0.08474576],
       [0.        , 0.        , 0.        , 0.72881356],
       [0.        , 2.        , 0.        , 0.42372881],
       [0.        , 0.        , 0.        , 0.45762712]])
# Podemos transformar as features de forma diferente
# Para isso podemos usar o ColumnTransformer

tr = ColumnTransformer(
    transformers=[
        ("cat-enc", OrdinalEncoder(), categorical_features),
        # ("min-max", MinMaxScaler(), numerical_features)
    ],
    remainder='passthrough' # passa para frente demais features
)

tr.fit_transform(X_train)[:10]
array([[ 0.,  2.,  1., 20.],
       [ 1.,  1.,  1., 45.],
       [ 0.,  1.,  1., 51.],
       [ 0.,  0.,  1., 19.],
       [ 1.,  1.,  1., 49.],
       [ 1.,  2.,  1., 23.],
       [ 0.,  1.,  1., 20.],
       [ 0.,  0.,  0., 58.],
       [ 0.,  2.,  0., 40.],
       [ 0.,  0.,  0., 42.]])

3.2 Modelando uma árvore de decisão#

Para isso, podemos usar a classe DecisionTreeClassifier, que gera modelos do algoritmo DT.

Observe que é necessário configurar os parâmetros que controlam o tamanho da árvore para evitar overfitting. Além disso, perdemos o entendimento do modelo ao criar árvores muito compridas. Veja a nota na documentação do algoritmo:

Notes

The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.

Sem limitar tamanho da árvore#

categorical_features = ["Sex", "BP", "Cholesterol"]

ct = ColumnTransformer(
    transformers=[
        ("cat", OrdinalEncoder(), categorical_features),
    ],
    remainder='passthrough' # passa para frente demais features
)

dt = DecisionTreeClassifier(
    criterion="gini",        # critério para medir a qualidade da divisão (split)
    class_weight="balanced", # atribuir o mesmo peso para as duas classes
    random_state=42
 )

pipe1 = Pipeline([
    ('preprocessing-1', ct),
    ('model', dt)
])
cv_list_dt = cross_val_score(
    pipe1,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)
mean_cv_dt = np.mean(cv_list_dt)
std_cv_dt = np.std(cv_list_dt)

print(f"Performance (bac): {round(mean_cv_dt, 4)} +- {round(std_cv_dt, 4)}")
Performance (bac): 0.6643 +- 0.1602
pipe1.fit(X_train, y_train)

model = pipe1["model"]
X_train_prep = pipe1["preprocessing-1"].transform(X_train)
feature_names = pipe1["preprocessing-1"].transformers_[0][2] + X_train.columns[pipe1["preprocessing-1"].transformers_[1][2]].to_list()
class_names = ["A", "B"]

plt.figure(figsize=(15,10))
plot_tree(
    model.fit(X_train_prep, y_train),
    feature_names=feature_names,
    class_names=class_names,
    filled=True,
    proportion=True
)
plt.show(True)
../_images/dbdd5fedb1aaef443ded0e8da57766d2ebaaaadfb5eda63c502dcd30581f22a2.png

Limitando o tamanho da árvore#

categorical_features = ["Sex", "BP", "Cholesterol"]

ct = ColumnTransformer(
    transformers=[
        ("cat", OrdinalEncoder(), categorical_features),
    ],
    remainder='passthrough' # passa para frente demais features
)

dt = DecisionTreeClassifier(
    criterion="gini",        # critério para medir a qualidade da divisão (split)
    max_depth=3,             # profundidade máxima da árvore
    min_samples_leaf=0.01,   # número mínimo de folhas em um nó - aqui no caso em porcentagem = 1%
    class_weight="balanced", # atribuir o mesmo peso para as duas classes
    random_state=42
 )

pipe2 = Pipeline([
    ('preprocessing-1', ct),
    ('model', dt)
])
cv_list_dt2 = cross_val_score(
    pipe2,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)
mean_cv_dt2 = np.mean(cv_list_dt2)
std_cv_dt2 = np.std(cv_list_dt2)

print(f"Performance (bac): {round(mean_cv_dt2, 4)} +- {round(std_cv_dt2, 4)}")
Performance (bac): 0.7845 +- 0.1012
pipe2.fit(X_train, y_train)

model = pipe2["model"]
X_train_prep = pipe2["preprocessing-1"].transform(X_train)
feature_names = pipe2["preprocessing-1"].transformers_[0][2] + X_train.columns[pipe2["preprocessing-1"].transformers_[1][2]].to_list()
class_names = ["A", "B"]

plt.figure(figsize=(15,10))
plot_tree(
    model.fit(X_train_prep, y_train),
    feature_names=feature_names,
    class_names=class_names,
    filled=True,
    proportion=True
)
plt.show(True)
../_images/3dbb1419dc1e301f1e242c5d557f6c1c704d4853e63aba51155027c1622d6880.png
# retreinar o pipeline selecionado com todos os dados de treinamento

pipe2.fit(X_train, y_train)
y_pred = pipe2.predict(X_test)
lr_pca_test = balanced_accuracy_score(y_test, y_pred)

print("Performance: ", round(lr_pca_test, 4))
Performance:  0.8036

4. DTreeViz (Decision Tree Visualization)#

# descomente o comando abaixo para instalar a dtreeviz
!pip install dtreeviz
Looking in indexes: https://SefazFeed:****@ads.intra.fazenda.sp.gov.br/tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/
WARNING: Ignoring invalid distribution -cipy (c:\users\mlfernandez\anaconda3\lib\site-packages)
WARNING: Ignoring invalid distribution -illow (c:\users\mlfernandez\anaconda3\lib\site-packages)
WARNING: Ignoring invalid distribution -tatsmodels (c:\users\mlfernandez\anaconda3\lib\site-packages)
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x0000023AF9BA5130>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')': /tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/dtreeviz/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x0000023AF9BA5400>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')': /tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/dtreeviz/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x0000023AF9BA55B0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')': /tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/dtreeviz/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x0000023AF9BA5760>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')': /tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/dtreeviz/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x0000023AF9BA5910>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')': /tfs/PRODUTOS/_packaging/SefazFeed/pypi/simple/dtreeviz/
ERROR: Could not find a version that satisfies the requirement dtreeviz (from versions: none)
ERROR: No matching distribution found for dtreeviz
from sklearn.datasets import load_iris
import dtreeviz
import logging

logging.getLogger('matplotlib.font_manager').disabled = True

classifier = DecisionTreeClassifier(max_depth=3)
iris = load_iris()
classifier.fit(iris.data, iris.target)

viz = dtreeviz.model(classifier,
              iris.data,
              iris.target,
              target_name='variety',
              feature_names=iris.feature_names,
              class_names=["setosa", "versicolor", "virginica"]
              )

viz.view(scale=1.4)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_8884\1144491373.py in <module>
      1 from sklearn.datasets import load_iris
----> 2 import dtreeviz
      3 import logging
      4 
      5 logging.getLogger('matplotlib.font_manager').disabled = True

ModuleNotFoundError: No module named 'dtreeviz'
viz.view(x=iris.data[100], scale=1.4)
../_images/1e507084f0b78053da3c75c3de49864cbce91b402f0d2399161878a6fa4f9ed0.svg

Referências & Links#

  1. Decision Trees

  2. Column Transformer

  3. IBM drug data

  4. Classification And Regression Trees

  5. dtreeviz

Prática 5a: Predição de abandono de produto de crédito usando o algoritmo Support Vector Machine#

Descrição & Objetivo#

Descrição dos Dados: Uma empresa possui um produto de cartão de crédito muito conhecido no mercado. Contudo, dada a alta competitividade, a empresa percebeu que seus clientes começaram a abandonar seu produto. A empresa contratou você para modelar quais clientes provavelmente irão abandonar o produto, para que ela interfira proativamente evitando a perda dos clientes. Os dados disponibilizados possuem diversas características dos clientes (e.g., sexo, idade, estado civil, escolaridade) e do uso do produto de crédito (e.g., limite do cartão, tipo do cartão). A coluna Attrition_Flag indica se o cliente abandonou ou não o produto.

Objetivo: Modelar o problema com SVM para classificar quais clientes provavelmente irão abandonar o produto.

1. Leitura dos Dados#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.tree import plot_tree, DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import balanced_accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
# Leitura de dados
data = pd.read_csv("credit_card_churn_data.csv")

target = 'Attrition_Flag'

categorical_features = ['Gender',  'Education_Level',
                        'Marital_Status', 'Income_Category', 'Card_Category'
                        ]

numerical_features = ['Dependent_count', 'Customer_Age',  'Months_on_book',
                      'Total_Relationship_Count', 'Months_Inactive_12_mon',
                      'Contacts_Count_12_mon', 'Credit_Limit',
                      'Total_Revolving_Bal', 'Avg_Open_To_Buy',
                      'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
                      'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1',
                      'Avg_Utilization_Ratio'
                      ]
data
Dependent_count Customer_Age Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio Gender Education_Level Marital_Status Income_Category Card_Category Attrition_Flag
0 3 45 39 5 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061 M High School Married $60K - $80K Blue Existing Customer
1 5 49 44 6 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105 F Graduate Single Less than $40K Blue Existing Customer
2 3 51 36 4 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000 M Graduate Married $80K - $120K Blue Existing Customer
3 4 40 34 3 4 1 3313.0 2517 796.0 1.405 1171 20 2.333 0.760 F High School Unknown Less than $40K Blue Existing Customer
4 3 40 21 5 1 0 4716.0 0 4716.0 2.175 816 28 2.500 0.000 M Uneducated Married $60K - $80K Blue Existing Customer
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10122 2 50 40 3 2 3 4003.0 1851 2152.0 0.703 15476 117 0.857 0.462 M Graduate Single $40K - $60K Blue Existing Customer
10123 2 41 25 4 2 3 4277.0 2186 2091.0 0.804 8764 69 0.683 0.511 M Unknown Divorced $40K - $60K Blue Attrited Customer
10124 1 44 36 5 3 4 5409.0 0 5409.0 0.819 10291 60 0.818 0.000 F High School Married Less than $40K Blue Attrited Customer
10125 2 30 36 4 3 3 5281.0 0 5281.0 0.535 8395 62 0.722 0.000 M Graduate Unknown $40K - $60K Blue Attrited Customer
10126 2 43 25 6 2 4 10388.0 1961 8427.0 0.703 10294 61 0.649 0.189 F Graduate Married Less than $40K Silver Attrited Customer

10127 rows × 20 columns

2. Análise Exploratória dos Dados#

# Número de exemplos e features
data.shape
(10127, 20)
# Contar classes
data[target].value_counts()
Attrition_Flag
Existing Customer    8500
Attrited Customer    1627
Name: count, dtype: int64
# Porcentagem de exemplos de cada classe
data[target].value_counts(normalize=True)
Attrition_Flag
Existing Customer    0.83934
Attrited Customer    0.16066
Name: proportion, dtype: float64
# Verificar se tem NaN
data.isna().sum()
Dependent_count             0
Customer_Age                0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
Gender                      0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Attrition_Flag              0
dtype: int64
# Descrever dados categóricos
data.describe(include=[object])
Gender Education_Level Marital_Status Income_Category Card_Category Attrition_Flag
count 10127 10127 10127 10127 10127 10127
unique 2 7 4 6 4 2
top F Graduate Married Less than $40K Blue Existing Customer
freq 5358 3128 4687 3561 9436 8500
# Descrever dados numéricos
data.describe(include=[np.number])
Dependent_count Customer_Age Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
count 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000
mean 2.346203 46.325960 35.928409 3.812580 2.341167 2.455317 8631.953698 1162.814061 7469.139637 0.759941 4404.086304 64.858695 0.712222 0.274894
std 1.298908 8.016814 7.986416 1.554408 1.010622 1.106225 9088.776650 814.987335 9090.685324 0.219207 3397.129254 23.472570 0.238086 0.275691
min 0.000000 26.000000 13.000000 1.000000 0.000000 0.000000 1438.300000 0.000000 3.000000 0.000000 510.000000 10.000000 0.000000 0.000000
25% 1.000000 41.000000 31.000000 3.000000 2.000000 2.000000 2555.000000 359.000000 1324.500000 0.631000 2155.500000 45.000000 0.582000 0.023000
50% 2.000000 46.000000 36.000000 4.000000 2.000000 2.000000 4549.000000 1276.000000 3474.000000 0.736000 3899.000000 67.000000 0.702000 0.176000
75% 3.000000 52.000000 40.000000 5.000000 3.000000 3.000000 11067.500000 1784.000000 9859.000000 0.859000 4741.000000 81.000000 0.818000 0.503000
max 5.000000 73.000000 56.000000 6.000000 6.000000 6.000000 34516.000000 2517.000000 34516.000000 3.397000 18484.000000 139.000000 3.714000 0.999000

3. Modelagem & Avaliação#

3.1 Baseline#

# Separar os dados em treinamento e teste

X = data.drop(columns=[target])
y = data[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
# Transformar features categóricas via one-hot-encoding
ct = ColumnTransformer( 
    transformers=[
        ("ohe", OneHotEncoder(), categorical_features),
    ],
    remainder='passthrough'
)

# zscore para mudar escala os dados
scaler = StandardScaler()

# modelo baseline
lrc = LogisticRegression(random_state=42)

# pipeline de machine learning
pipeb = Pipeline([
    ('column-transformer', ct),
    ('scaler', scaler),
    ('model', lrc)
])

# cross validação da solução
cv_list_lr_baseline = cross_val_score(
    pipeb,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)
mean_cv_lr_baseline = np.mean(cv_list_lr_baseline)
std_cv_lr_baseline = np.std(cv_list_lr_baseline)

print(f"Performance (bac): {round(mean_cv_lr_baseline, 4)} +- {round(std_cv_lr_baseline, 4)}")
Performance (bac): 0.7834 +- 0.0227

3.2 Modelando com SVM#

n_samples = 1500
X_, Y_ = datasets.make_circles(n_samples=n_samples, factor=0.5, noise=0.05)

Y_ = ["c1" if i == 0 else "c2" for i in Y_]

aux_data = pd.DataFrame({"x": X_[:, 0], "y": X_[:, 1], "color": Y_})

fig = px.scatter(aux_data, x="x", y="y", color="color", opacity=0.5)
fig.update_layout(autosize=False, width=600, height=600)
fig.show()

Mapeando os dados em uma função polinomial de segundo grau:

\( k(x, y) = \begin{pmatrix} x^2\\ \sqrt{2}xy\\ y^2\\ \end{pmatrix} \)

def k(x, y):
  return x**2, np.sqrt(2)*x*y, y**2

x, y, z = k(X_[:,0], X_[:, 1])
aux_data = pd.DataFrame({"x": x, "y": y, "z": z, "color": Y_})

fig = px.scatter_3d(aux_data, x="x", y="y", z="z", color="color", opacity=0.5)
fig.update_layout(autosize=False, width=600, height=600)
fig.show()

Kernel Linear#

O kernel linear é definido no sklearn como:

\(\left \langle x, x' \right \rangle\)

# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer( 
    transformers=[
        ("ohe", OneHotEncoder(), categorical_features),
    ],
    remainder='passthrough'
)

# minmax para mudar escala os dados
scaler = MinMaxScaler()

# modelo SVM
svc = SVC(
    C=1.0, # C é o hiperparâmetro de regularização.
           # Ele controla o trade-off entre o limite de decisão suave e a
           # classificação correta dos pontos de treinamento.
           # O aumento dos valores de C pode levar a um ajuste excessivo dos dados de treinamento.
    kernel="linear", # tipo do kernel
    class_weight="balanced", # atribuir o mesmo peso para as duas classes
    max_iter=100000, # número de iterações do otimizador
    random_state=42
    )

# pipeline de machine learning
pipe1 = Pipeline([
    ('column-transformer', ct),
    ('scaler', scaler),
    ('model', svc)
])

# cross validação da solução
cv_list_pipe1_baseline = cross_val_score(
    pipe1,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)

mean_cv_pipe1_baseline = np.mean(cv_list_pipe1_baseline)
std_cv_pipe1_baseline = np.std(cv_list_pipe1_baseline)

print(f"Performance (bac): {round(mean_cv_pipe1_baseline, 4)} +- {round(std_cv_pipe1_baseline, 4)}")
Performance (bac): 0.855 +- 0.0119

Kernel Polinomial#

O kernel polinomial é definido no sklearn como:

\(\left ( \gamma \left \langle x, x' \right \rangle + r \right )^d\)

onde, \(d\) é especificado pelo hyperparametro degree, \(r\) por coef0 e \(\gamma\) por gamma.

# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer( 
    transformers=[
        ("ohe", OneHotEncoder(), categorical_features),
    ],
    remainder='passthrough'
)

# minmax para mudar escala os dados
scaler = MinMaxScaler()

# modelo SVM
svc = SVC(
    C=1.0, # C é o hiperparâmetro de regularização.
           # Ele controla o trade-off entre o limite de decisão suave e a
           # classificação correta dos pontos de treinamento.
           # O aumento dos valores de C pode levar a um ajuste excessivo dos dados de treinamento.
    kernel="poly", # tipo do kernel
    degree=3, # grau do kernel polinomial
    coef0=1, # termo independente da função kernel
    gamma="scale", # coeficiente gamma da função kernel
                   # scale =  1 / (n_features * X.var())
                   # auto = 1 / n_features
                   # podemos usar float > 0
    class_weight="balanced", # atribuir o mesmo peso para as duas classes
    max_iter=100000, # número de iterações do otimizador
    random_state=42
    )

# pipeline de machine learning
pipe2 = Pipeline([
    ('column-transformer', ct),
    ('scaler', scaler),
    ('model', svc)
])

# cross validação da solução
cv_list_pipe2_baseline = cross_val_score(
    pipe2,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)

mean_cv_pipe2_baseline = np.mean(cv_list_pipe2_baseline)
std_cv_pipe2_baseline = np.std(cv_list_pipe2_baseline)

print(f"Performance (bac): {round(mean_cv_pipe2_baseline, 4)} +- {round(std_cv_pipe2_baseline, 4)}")
Performance (bac): 0.8692 +- 0.0162

Kernel RBF#

O kernel rbf é definido no sklearn como:

\(exp(-\gamma \left \| x - x' \right\|^2)\)

onde, \(\gamma\) é especificado pelo hyperparametro gamma e deve ser \(\geq 0\).

# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer( 
    transformers=[
        ("ohe", OneHotEncoder(), categorical_features),
    ],
    remainder='passthrough'
)

# minmax para mudar escala os dados
scaler = MinMaxScaler()

# modelo SVM
svc = SVC(
    C=1.0, # C é o hiperparâmetro de regularização.
           # Ele controla o trade-off entre o limite de decisão suave e a
           # classificação correta dos pontos de treinamento.
           # O aumento dos valores de C pode levar a um ajuste excessivo dos dados de treinamento.
    kernel="rbf", # tipo do kernel
    gamma="scale", # coeficiente gamma da função kernel
                   # scale = 1 / (n_features * X.var())
                   # auto = 1 / n_features
                   # podemos usar float > 0
    class_weight="balanced", # atribuir o mesmo peso para as duas classes
    max_iter=100000, # número de iterações do otimizador
    random_state=42
    )

# pipeline de machine learning
pipe3 = Pipeline([
    ('column-transformer', ct),
    ('scaler', scaler),
    ('model', svc)
])

# cross validação da solução
cv_list_pipe3_baseline = cross_val_score(
    pipe3,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)

mean_cv_pipe3_baseline = np.mean(cv_list_pipe3_baseline)
std_cv_pipe3_baseline = np.std(cv_list_pipe3_baseline)

print(f"Performance (bac): {round(mean_cv_pipe3_baseline, 4)} +- {round(std_cv_pipe3_baseline, 4)}")
Performance (bac): 0.857 +- 0.0176

Avaliação Experimental#

# resultados da cross-validação

df_result_cv = pd.DataFrame(
    [cv_list_lr_baseline, cv_list_pipe1_baseline, cv_list_pipe2_baseline, cv_list_pipe3_baseline],
    index=["baseline","SVM-linear", "SVM-poly", "SMV-rbf"]
).T

df_result_cv
baseline SVM-linear SVM-poly SMV-rbf
0 0.761374 0.846160 0.842349 0.824149
1 0.765576 0.847759 0.867743 0.861861
2 0.768097 0.881166 0.889017 0.884159
3 0.777053 0.856347 0.881925 0.866431
4 0.777053 0.854666 0.861020 0.839275
5 0.782279 0.855587 0.862045 0.859052
6 0.836776 0.867168 0.891722 0.862413
7 0.777893 0.860077 0.864934 0.869505
8 0.815479 0.839403 0.847289 0.834308
9 0.772090 0.841585 0.884272 0.868476
# linearizar matriz
df_res = df_result_cv.stack().to_frame("balanced_accuracy")
df_res.index.rename(["fold", "pipelines"], inplace=True)
df_res = df_res.reset_index()
df_res.head(12)
fold pipelines balanced_accuracy
0 0 baseline 0.761374
1 0 SVM-linear 0.846160
2 0 SVM-poly 0.842349
3 0 SMV-rbf 0.824149
4 1 baseline 0.765576
5 1 SVM-linear 0.847759
6 1 SVM-poly 0.867743
7 1 SMV-rbf 0.861861
8 2 baseline 0.768097
9 2 SVM-linear 0.881166
10 2 SVM-poly 0.889017
11 2 SMV-rbf 0.884159
plt.figure(figsize=(10,10))

ax = sns.boxplot(x="pipelines", y="balanced_accuracy", data=df_res)
ax = sns.swarmplot(x="pipelines", y="balanced_accuracy", data=df_res, color=".40")
../_images/96113cd02d1277299bf1a70683e174655675577c04e9093a9069efecb6e0b180.png
# retreinar o pipeline selecionado com todos os dados de treinamento

pipe2.fit(X_train, y_train)
y_pred = pipe2.predict(X_test)
bac = balanced_accuracy_score(y_test, y_pred) 

print("Performance: ", round(bac, 4))
Performance:  0.8633

Referências & Links#

  1. Suppot Vector Machines

  2. Credit Card Customer Data

  3. The Kernel Trick in Support Vector Classification

  4. Understanding Support Vector Machine(SVM) algorithm from examples (along with code)

© ealcobaca (2022)

MBA em Ciência de Dados#

Universidade de São Paulo, São Carlos, Brasil#

Disciplina: Aprendizado de Máquina

Professor Reponsavel: Prof. Dr. André Carlos Ponce de Leon Ferreira de Carvalho 📧 andre@icmc.usp.br 🌐 linkedin 🔗 site

Prática 5b: Predição de abandono de produto de crédito usando o algoritmo Multilayer Perceptron#

Autor: Edesio Alcobaça 📧 edesio@usp.br 🌐 linkedin 🔗 site

Descrição & Objetivo#

Descrição dos Dados: Uma empresa possui um produto de cartão de crédito muito conhecido no mercado. Contudo, dada a alta competitividade, a empresa percebeu que seus clientes começaram a abandonar seu produto. A empresa contratou você para modelar quais clientes provavelmente irão abandonar o produto, para que ela interfira proativamente evitando a perda dos clientes. Os dados disponibilizados possuem diversas características dos clientes (e.g., sexo, idade, estado civil, escolaridade) e do uso do produto de crédito (e.g., limite do cartão, tipo do cartão). A coluna Attrition_Flag indica se o cliente abandonou ou não o produto.

Objetivo: Modelar o problema com MLP de forma a retornar uma lista ordenada de clientes, do mais ao menos provável a abandonar o produto.

1. Leitura dos Dados#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.tree import plot_tree, DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import balanced_accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
# Leitura de dados
data = pd.read_csv("credit_card_churn_data.csv")

target = 'Attrition_Flag'

categorical_features = ['Gender',  'Education_Level',
                        'Marital_Status', 'Income_Category', 'Card_Category'
                        ]

numerical_features = ['Dependent_count', 'Customer_Age',  'Months_on_book',
                      'Total_Relationship_Count', 'Months_Inactive_12_mon',
                      'Contacts_Count_12_mon', 'Credit_Limit',
                      'Total_Revolving_Bal', 'Avg_Open_To_Buy',
                      'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
                      'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1',
                      'Avg_Utilization_Ratio'
                      ]
data
Dependent_count Customer_Age Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio Gender Education_Level Marital_Status Income_Category Card_Category Attrition_Flag
0 3 45 39 5 1 3 12691.0 777 11914.0 1.335 1144 42 1.625 0.061 M High School Married $60K - $80K Blue Existing Customer
1 5 49 44 6 1 2 8256.0 864 7392.0 1.541 1291 33 3.714 0.105 F Graduate Single Less than $40K Blue Existing Customer
2 3 51 36 4 1 0 3418.0 0 3418.0 2.594 1887 20 2.333 0.000 M Graduate Married $80K - $120K Blue Existing Customer
3 4 40 34 3 4 1 3313.0 2517 796.0 1.405 1171 20 2.333 0.760 F High School Unknown Less than $40K Blue Existing Customer
4 3 40 21 5 1 0 4716.0 0 4716.0 2.175 816 28 2.500 0.000 M Uneducated Married $60K - $80K Blue Existing Customer
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10122 2 50 40 3 2 3 4003.0 1851 2152.0 0.703 15476 117 0.857 0.462 M Graduate Single $40K - $60K Blue Existing Customer
10123 2 41 25 4 2 3 4277.0 2186 2091.0 0.804 8764 69 0.683 0.511 M Unknown Divorced $40K - $60K Blue Attrited Customer
10124 1 44 36 5 3 4 5409.0 0 5409.0 0.819 10291 60 0.818 0.000 F High School Married Less than $40K Blue Attrited Customer
10125 2 30 36 4 3 3 5281.0 0 5281.0 0.535 8395 62 0.722 0.000 M Graduate Unknown $40K - $60K Blue Attrited Customer
10126 2 43 25 6 2 4 10388.0 1961 8427.0 0.703 10294 61 0.649 0.189 F Graduate Married Less than $40K Silver Attrited Customer

10127 rows × 20 columns

2. Análise Exploratória dos Dados#

# Número de exemplos e features
data.shape
(10127, 20)
# Contar classes
data[target].value_counts()
Existing Customer    8500
Attrited Customer    1627
Name: Attrition_Flag, dtype: int64
# Porcentagem de exemplos de cada classe
data[target].value_counts(normalize=True)
Existing Customer    0.83934
Attrited Customer    0.16066
Name: Attrition_Flag, dtype: float64
# Verificar se tem NaN
data.isna().sum()
Dependent_count             0
Customer_Age                0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
Gender                      0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Attrition_Flag              0
dtype: int64
# Descrever dados categóricos
data.describe(include=[object])
Gender Education_Level Marital_Status Income_Category Card_Category Attrition_Flag
count 10127 10127 10127 10127 10127 10127
unique 2 7 4 6 4 2
top F Graduate Married Less than $40K Blue Existing Customer
freq 5358 3128 4687 3561 9436 8500
# Descrever dados numéricos
data.describe(include=[np.number])
Dependent_count Customer_Age Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
count 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000 10127.000000
mean 2.346203 46.325960 35.928409 3.812580 2.341167 2.455317 8631.953698 1162.814061 7469.139637 0.759941 4404.086304 64.858695 0.712222 0.274894
std 1.298908 8.016814 7.986416 1.554408 1.010622 1.106225 9088.776650 814.987335 9090.685324 0.219207 3397.129254 23.472570 0.238086 0.275691
min 0.000000 26.000000 13.000000 1.000000 0.000000 0.000000 1438.300000 0.000000 3.000000 0.000000 510.000000 10.000000 0.000000 0.000000
25% 1.000000 41.000000 31.000000 3.000000 2.000000 2.000000 2555.000000 359.000000 1324.500000 0.631000 2155.500000 45.000000 0.582000 0.023000
50% 2.000000 46.000000 36.000000 4.000000 2.000000 2.000000 4549.000000 1276.000000 3474.000000 0.736000 3899.000000 67.000000 0.702000 0.176000
75% 3.000000 52.000000 40.000000 5.000000 3.000000 3.000000 11067.500000 1784.000000 9859.000000 0.859000 4741.000000 81.000000 0.818000 0.503000
max 5.000000 73.000000 56.000000 6.000000 6.000000 6.000000 34516.000000 2517.000000 34516.000000 3.397000 18484.000000 139.000000 3.714000 0.999000

3. Modelagem & Avaliação#

3.1 Baseline#

# Separar os dados em treinamento e teste

X = data.drop(columns=[target])
y = data[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer( 
    transformers=[
        ("ohe", OneHotEncoder(), categorical_features),
    ],
    remainder='passthrough'
)

# zscore para mudar escala os dados
scaler = StandardScaler()

# modelo baseline
lrc = LogisticRegression(random_state=42)

# pipeline de machine learning
pipeb = Pipeline([
    ('column-transformer', ct),
    ('scaler', scaler),
    ('model', lrc)
])

# cross validação da solução
cv_list_lr_baseline = cross_val_score(
    pipeb,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)
mean_cv_lr_baseline = np.mean(cv_list_lr_baseline)
std_cv_lr_baseline = np.std(cv_list_lr_baseline)

print(f"Performance (bac): {round(mean_cv_lr_baseline, 4)} +- {round(std_cv_lr_baseline, 4)}")
Performance (bac): 0.7834 +- 0.0227

3.2 Modelando com MLP#

title

1 camada oculta –> (10, )#

# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer( 
    transformers=[
        ("ohe", OneHotEncoder(), categorical_features),
    ],
    remainder='passthrough'
)

# minmax para mudar escala os dados
scaler = MinMaxScaler()

# modelo baseline
mlp = MLPClassifier(
    hidden_layer_sizes=(10,), # numero de neurônios 
    activation = "relu", # função de ativação para a camada oculta
                         # As possibilidades são: identity, logistic, tanh e relu
    alpha=0.0001, # Força do termo regularizador L2.
    solver="adam", # otimizador
    batch_size="auto", # Tamanho do mini-batch para otimizadores estocásticos
    learning_rate_init= 0.01, # taxa de aprendizado inicial. Ele controla o tamanho do passo na atualização dos pesos
    learning_rate="constant", # possibilidades: constant, invscaling, adaptative
    early_stopping=False, # Se a interrupção antecipada deve ser usada para encerrar
                          # o treinamento quando a performance de validação não estiver melhorando.
    max_iter=200, # número de iterações do otimizador
    random_state=0
    )

# pipeline de machine learning
pipe1 = Pipeline([
    ('column-transformer', ct),
    ('scaler', scaler),
    ('model', mlp)
])

# cross validação da solução
cv_list_pipe1_baseline = cross_val_score(
    pipe1,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)

mean_cv_pipe1_baseline = np.mean(cv_list_pipe1_baseline)
std_cv_pipe1_baseline = np.std(cv_list_pipe1_baseline)

print(f"Performance (bac): {round(mean_cv_pipe1_baseline, 4)} +- {round(std_cv_pipe1_baseline, 4)}")
Performance (bac): 0.8612 +- 0.0277

2 camada oculta –> (10, 10)#

# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer( 
    transformers=[
        ("ohe", OneHotEncoder(), categorical_features),
    ],
    remainder='passthrough'
)

# minmax para mudar escala os dados
scaler = MinMaxScaler()

# modelo baseline
mlp = MLPClassifier(
    hidden_layer_sizes=(10, 10), # número de neurônios 
    activation = "relu", # função de ativação para a camada oculta
                         # As possibilidades são: identity, logistic, tanh e relu
    alpha=0.0001, # Força do termo regularizador L2.
    solver="adam", # otimizador
    batch_size="auto", # Tamanho do mini-batch para otimizadores estocásticos
    learning_rate_init= 0.01, # taxa de aprendizado inicial. Ele controla o tamanho do passo na atualização dos pesos
    learning_rate="constant", # possibilidades: constant, invscaling, adaptative
    early_stopping=False, # Se a interrupção antecipada deve ser usada para encerrar
                          # o treinamento quando a performance de validação não estiver melhorando.
    max_iter=200, # número de iterações do otimizador
    random_state=0
    )

# pipeline de machine learning
pipe2 = Pipeline([
    ('column-transformer', ct),
    ('scaler', scaler),
    ('model', mlp)
])

# cross validação da solução
cv_list_pipe2_baseline = cross_val_score(
    pipe2,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)

mean_cv_pipe2_baseline = np.mean(cv_list_pipe2_baseline)
std_cv_pipe2_baseline = np.std(cv_list_pipe2_baseline)

print(f"Performance (bac): {round(mean_cv_pipe2_baseline, 4)} +- {round(std_cv_pipe2_baseline, 4)}")
/usr/local/lib/python3.7/dist-packages/sklearn/neural_network/_multilayer_perceptron.py:696: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  ConvergenceWarning,
Performance (bac): 0.8679 +- 0.0367

3 camada oculta –> (10, 10, 10)#

# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer( 
    transformers=[
        ("ohe", OneHotEncoder(), categorical_features),
    ],
    remainder='passthrough'
)

# minmax para mudar escala os dados
scaler = MinMaxScaler()

# modelo baseline
mlp = MLPClassifier(
    hidden_layer_sizes=(10, 10, 10), # número de neurônios 
    activation = "relu", # função de ativação para a camada oculta
                         # As possibilidades são: identity, logistic, tanh e relu
    alpha=0.0001, # Força do termo regularizador L2.
    solver="adam", # otimizador
    batch_size="auto", # Tamanho do mini-batch para otimizadores estocásticos
    learning_rate_init= 0.01, # taxa de aprendizado inicial. Ele controla o tamanho do passo na atualização dos pesos
    learning_rate="constant", # possibilidades: constant, invscaling, adaptative
    early_stopping=False, # Se a interrupção antecipada deve ser usada para encerrar
                          # o treinamento quando a performance de validação não estiver melhorando.
    max_iter=200, # número de iterações do otimizador
    random_state=0
    )

# pipeline de machine learning
pipe3 = Pipeline([
    ('column-transformer', ct),
    ('scaler', scaler),
    ('model', mlp)
])

# cross validação da solução
cv_list_pipe3_baseline = cross_val_score(
    pipe3,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)

mean_cv_pipe3_baseline = np.mean(cv_list_pipe3_baseline)
std_cv_pipe3_baseline = np.std(cv_list_pipe3_baseline)

print(f"Performance (bac): {round(mean_cv_pipe3_baseline, 4)} +- {round(std_cv_pipe3_baseline, 4)}")
Performance (bac): 0.9082 +- 0.0129

4 camada oculta –> (10, 10, 10, 10)#

# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer( 
    transformers=[
        ("ohe", OneHotEncoder(), categorical_features),
    ],
    remainder='passthrough'
)

# zscore para mudar escala os dados
scaler = MinMaxScaler()

# modelo baseline
mlp = MLPClassifier(
    hidden_layer_sizes=(10, 10, 10, 10), # número de neurônios 
    activation = "relu", # função de ativação para a camada oculta
                         # As possibilidades são: identity, logistic, tanh e relu
    alpha=0.0001, # Força do termo regularizador L2.
    solver="adam", # otimizador
    batch_size="auto", # Tamanho do mini-batch para otimizadores estocásticos
    learning_rate_init= 0.01, # taxa de aprendizado inicial. Ele controla o tamanho do passo na atualização dos pesos
    learning_rate="constant", # possibilidades: constant, invscaling, adaptative
    early_stopping=False, # Se a interrupção antecipada deve ser usada para encerrar
                          # o treinamento quando a performance de validação não estiver melhorando.
    max_iter=200, # número de iterações do otimizador
    random_state=0
    )

# pipeline de machine learning
pipe4 = Pipeline([
    ('column-transformer', ct),
    ('scaler', scaler),
    ('model', mlp)
])

# cross validação da solução
cv_list_pipe4_baseline = cross_val_score(
    pipe4,
    X_train,
    y_train,
    cv=10,
    scoring="balanced_accuracy"
)

mean_cv_pipe4_baseline = np.mean(cv_list_pipe4_baseline)
std_cv_pipe4_baseline = np.std(cv_list_pipe4_baseline)

print(f"Performance (bac): {round(mean_cv_pipe4_baseline, 4)} +- {round(std_cv_pipe4_baseline, 4)}")
Performance (bac): 0.8684 +- 0.0163

Avaliação Experimental#

# resultados da cross-validação

df_result_cv = pd.DataFrame(
    [cv_list_lr_baseline, cv_list_pipe1_baseline, cv_list_pipe2_baseline, cv_list_pipe3_baseline, cv_list_pipe4_baseline],
    index=["baseline","MLP(10,)", "MLP(10, 10)", "MLP(10, 10, 10)", "MLP(10, 10, 10, 10)"]
).T

df_result_cv
baseline MLP(10,) MLP(10, 10) MLP(10, 10, 10) MLP(10, 10, 10, 10)
0 0.761374 0.822778 0.853008 0.899676 0.872232
1 0.765576 0.830341 0.863092 0.911337 0.833886
2 0.768097 0.863092 0.853111 0.880267 0.866453
3 0.777053 0.908448 0.892769 0.905558 0.862067
4 0.777053 0.872520 0.879979 0.923286 0.896970
5 0.782279 0.834911 0.816342 0.901356 0.883341
6 0.836776 0.887542 0.889776 0.923470 0.868974
7 0.777893 0.849093 0.800295 0.903590 0.857865
8 0.815479 0.895800 0.919059 0.908426 0.882111
9 0.772090 0.847929 0.912003 0.924831 0.860140
# linearizar matriz
df_res = df_result_cv.stack().to_frame("balanced_accuracy")
df_res.index.rename(["fold", "pipelines"], inplace=True)
df_res = df_res.reset_index()
df_res.head(12)
fold pipelines balanced_accuracy
0 0 baseline 0.761374
1 0 MLP(10,) 0.822778
2 0 MLP(10, 10) 0.853008
3 0 MLP(10, 10, 10) 0.899676
4 0 MLP(10, 10, 10, 10) 0.872232
5 1 baseline 0.765576
6 1 MLP(10,) 0.830341
7 1 MLP(10, 10) 0.863092
8 1 MLP(10, 10, 10) 0.911337
9 1 MLP(10, 10, 10, 10) 0.833886
10 2 baseline 0.768097
11 2 MLP(10,) 0.863092
plt.figure(figsize=(10,10))

ax = sns.boxplot(x="pipelines", y="balanced_accuracy", data=df_res)
ax = sns.swarmplot(x="pipelines", y="balanced_accuracy", data=df_res, color=".40")
../_images/1172e985c81c2f5e26362086f302086571dffcdc919b75fb3cba2414a3411b86.png

ROC curve#

from sklearn.metrics import DetCurveDisplay, RocCurveDisplay

pipelines = {
    "baseline": pipeb,
    "MLP(10,)": pipe1,
    "MLP(10, 10)": pipe2,
    "MLP(10, 10, 10)": pipe3,
    "MLP(10, 10, 10, 10)": pipe4,
}

fig, [ax_roc, ax_det] = plt.subplots(1, 2, figsize=(11, 5))

for name, clf in pipelines.items():
    clf.fit(X_train, y_train)

    RocCurveDisplay.from_estimator(clf, X_test, y_test, ax=ax_roc, name=name)
    DetCurveDisplay.from_estimator(clf, X_test, y_test, ax=ax_det, name=name)

ax_roc.set_title("Receiver Operating Characteristic (ROC) curves")
ax_det.set_title("Detection Error Tradeoff (DET) curves")

ax_roc.grid(linestyle="--")
ax_det.grid(linestyle="--")

plt.legend()
plt.show()
../_images/8ea0a711ec3bc94abccb07646816d3dce58289c78223b2b476d147aaf998c9d3.png

Final pipeline#

# retreinar o pipeline selecionado com todos os dados de treinamento

pipe3.fit(X_train, y_train)
y_pred = pipe2.predict(X_test)
bac = balanced_accuracy_score(y_test, y_pred) 

print("Performance: ", round(bac, 4))
Performance:  0.8441
# Em vez de usar as predições podemos usar a probabilidade
# Desta maneira conseguiremos construir uma lista, ordenando dos mais aos menos prováveis de abandonar o produto

y_pred_prob = pipe3.predict_proba(X_test)
y_pred_prob[:, 0]
array([4.96625438e-05, 1.80525853e-03, 9.37936844e-01, ...,
       1.35154815e-02, 7.06941771e-07, 6.21152944e-04])

Referências & Links#

  1. Multilayer Perceptron

  2. Credit Card Customer Data

  3. Multilayer Perceptron Explained with a Real-Life Example and Python Code: Sentiment Analysis)

Prática 6: Usando Algoritmos de Ensemble em um Problema de Regressão#

Descrição & Objetivo#

Descrição dos Dados: Consumidores fanáticos por abacate decidiram criar uma planilha com o preço médio anual em diversas regiões durante os anos de 2015 a 2018. Os seguintes campos foram anotados:

  • AveragePrice - preço médio de um abacate

  • type - tipo do abacate: conventional or organic

  • year - ano

  • Region - cidade ou região onde a observação foi feita

  • Total Volume - Número médio de abacates vendidos diariamente

  • 4046 - Número médio diário de abacates vendidos com PLU 4046

  • 4225 - Número médio diário de abacates vendidos com PLU 4225

  • 4770 - Número médio diário de abacates vendidos com PLU 4770

Para entender mais sobre o PLU veja esse site.

Objetivo: Modelar o problema com Ensembles de forma a prever o preço médio de um abacate.

1. Imports#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.tree import plot_tree, DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import balanced_accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

2. Leitura dos Dados#

# Leitura de dados
data = pd.read_csv("avocado_grouped_data.csv")

target = 'AveragePrice'

categorical_features = ["region", "type"]

numerical_features = ["year", "Total Volume", "4046", "4225", "4770"]

data
year type region AveragePrice Total Volume 4046 4225 4770
0 2015 conventional Albany 1.171923 7.620873e+04 1037.874615 61764.253654 668.795000
1 2015 conventional Atlanta 1.052308 4.403464e+05 347741.840385 35386.637308 757.858077
2 2015 conventional BaltimoreWashington 1.168077 7.681415e+05 56546.030769 487421.365385 45104.819423
3 2015 conventional Boise 1.054038 7.088575e+04 45940.442500 10164.187115 5309.087692
4 2015 conventional Boston 1.144038 5.237806e+05 4685.945192 409901.282692 1607.198846
... ... ... ... ... ... ... ... ...
427 2018 organic Syracuse 1.242500 5.688144e+03 143.621667 101.041667 0.000000
428 2018 organic Tampa 1.452500 8.415949e+03 125.270833 610.078333 0.000000
429 2018 organic TotalUS 1.554167 1.510488e+06 137640.345000 360420.295000 1407.105833
430 2018 organic West 1.613333 2.549791e+05 28676.619167 53221.085833 191.065833
431 2018 organic WestTexNewMexico 1.645000 1.676254e+04 1918.260833 2275.085000 140.700833

432 rows × 8 columns

2. Análise Exploratória de Dados#

# Número de exemplos e features
data.shape
(432, 8)
# target distribution
data[target].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d2336d4d0>
../_images/42881c3a700f73dd798bd773cce1d1aff6e3524728556f741c317222b17cd0d7.png
# estatisticas da target
data[target].describe()
count    432.000000
mean       1.394253
std        0.320811
min        0.659808
25%        1.145817
50%        1.385064
75%        1.630096
max        2.403208
Name: AveragePrice, dtype: float64
# Verificar se tem NaN
data.isna().sum()
year            0
type            0
region          0
AveragePrice    0
Total Volume    0
4046            0
4225            0
4770            0
dtype: int64
# Descrever dados categóricos
data.describe(include=[object])
type region
count 432 432
unique 2 54
top conventional Albany
freq 216 8
# Descrever dados numéricos
data.describe(include=[np.number])
year AveragePrice Total Volume 4046 4225 4770
count 432.00000 432.000000 4.320000e+02 4.320000e+02 4.320000e+02 4.320000e+02
mean 2016.50000 1.394253 8.920706e+05 3.049745e+05 2.989822e+05 2.188050e+04
std 1.11933 0.320811 3.582444e+06 1.295333e+06 1.194733e+06 9.576609e+04
min 2015.00000 0.659808 1.289274e+03 2.679057e+00 6.390962e+00 0.000000e+00
25% 2015.75000 1.145817 1.230606e+04 8.947879e+02 3.480897e+03 9.158173e-01
50% 2016.50000 1.385064 1.180031e+05 1.063797e+04 3.042860e+04 3.231658e+02
75% 2017.25000 1.630096 4.774981e+05 1.167135e+05 1.612054e+05 7.488512e+03
max 2018.00000 2.403208 4.212553e+07 1.467107e+07 1.243870e+07 1.153793e+06

3. Modelagem & Avaliação com Ensembles#

3.1 Baseline#

# Separar os dados em treinamento e teste

X = data.drop(columns=[target])
y = data[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer( 
    transformers=[
        ("cat", OrdinalEncoder(), categorical_features),
    ],
    remainder='passthrough'
)

# zscore para mudar escala os dados
scaler = StandardScaler()

# modelo baseline
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

# pipeline de machine learning
pipeb = Pipeline([
    ('column-transformer', ct),
    ('scaler', scaler),
    ('model', lr)
])

# cross validação da solução
cv_list_lr_baseline = cross_val_score(
    pipeb,
    X_train,
    y_train,
    cv=10,
    scoring="r2"
)
mean_cv_lr_baseline = np.mean(cv_list_lr_baseline)
std_cv_lr_baseline = np.std(cv_list_lr_baseline)

print(f"Performance (r2): {round(mean_cv_lr_baseline, 4)} +- {round(std_cv_lr_baseline, 4)}")
Performance (r2): 0.5537 +- 0.1066

image

3.2 Modelando com Random Forest e Gradient Boosting#

# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer( 
    transformers=[
        ("ohe", OneHotEncoder(), categorical_features),
    ],
    remainder='passthrough'
)

# modelo RandomForestRegressor
rf = RandomForestRegressor(
    n_estimators = 100,
    criterion="squared_error",
    random_state=0
    )

# pipeline de machine learning
pipe1 = Pipeline([
    ('column-transformer', ct),
    ('model', rf)
])

# cross validação da solução
cv_list_pipe1_baseline = cross_val_score(
    pipe1,
    X_train,
    y_train,
    cv=10,
    scoring="r2"
)

mean_cv_pipe1_baseline = np.mean(cv_list_pipe1_baseline)
std_cv_pipe1_baseline = np.std(cv_list_pipe1_baseline)

print(f"Performance (R2): {round(mean_cv_pipe1_baseline, 4)} +- {round(std_cv_pipe1_baseline, 4)}")
Performance (R2): 0.7605 +- 0.0821
# transformar features categóricas via one-hot-encoding
ct = ColumnTransformer( 
    transformers=[
        ("ohe", OneHotEncoder(), categorical_features),
    ],
    remainder='passthrough'
)

# modelo GradientBoostingRegressor
rf = GradientBoostingRegressor(
    n_estimators = 100,
    loss="squared_error",
    learning_rate=0.1,
    random_state=0
)

# pipeline de machine learning
pipe2 = Pipeline([
    ('column-transformer', ct),
    ('model', rf)
])

# cross validação da solução
cv_list_pipe2_baseline = cross_val_score(
    pipe2,
    X_train,
    y_train,
    cv=10,
    scoring="r2"
)

mean_cv_pipe2_baseline = np.mean(cv_list_pipe2_baseline)
std_cv_pipe2_baseline = np.std(cv_list_pipe2_baseline)

print(f"Performance (R2): {round(mean_cv_pipe2_baseline, 4)} +- {round(std_cv_pipe2_baseline, 4)}")
Performance (R2): 0.7936 +- 0.0495

Avaliação Experimental#

# resultados da cross-validação

df_result_cv = pd.DataFrame(
    [cv_list_lr_baseline, cv_list_pipe1_baseline, cv_list_pipe2_baseline],
    index=["baseline","RandomForestRegressor", "GradientBoostingRegressor"]
).T

df_result_cv
baseline RandomForestRegressor GradientBoostingRegressor
0 0.538925 0.770106 0.792857
1 0.463798 0.557949 0.676694
2 0.630720 0.783451 0.820644
3 0.639834 0.730605 0.807127
4 0.402981 0.711545 0.817230
5 0.444394 0.778620 0.759449
6 0.543634 0.806082 0.770342
7 0.456912 0.746108 0.793083
8 0.720543 0.859275 0.822751
9 0.695542 0.860805 0.876062
# linearizar matriz
df_res = df_result_cv.stack().to_frame("balanced_accuracy")
df_res.index.rename(["fold", "pipelines"], inplace=True)
df_res = df_res.reset_index()
df_res.head(12)
fold pipelines balanced_accuracy
0 0 baseline 0.538925
1 0 RandomForestRegressor 0.770106
2 0 GradientBoostingRegressor 0.792857
3 1 baseline 0.463798
4 1 RandomForestRegressor 0.557949
5 1 GradientBoostingRegressor 0.676694
6 2 baseline 0.630720
7 2 RandomForestRegressor 0.783451
8 2 GradientBoostingRegressor 0.820644
9 3 baseline 0.639834
10 3 RandomForestRegressor 0.730605
11 3 GradientBoostingRegressor 0.807127
plt.figure(figsize=(10,10))

ax = sns.boxplot(x="pipelines", y="balanced_accuracy", data=df_res)
ax = sns.swarmplot(x="pipelines", y="balanced_accuracy", data=df_res, color=".40")
../_images/5db4f6bca8c687b89e65dd5bc1fea5783a9ecbccb308af3e5d427cc54bca4e15.png

Final pipeline#

# retreinar o pipeline selecionado com todos os dados de treinamento
from sklearn.metrics import r2_score

pipe2.fit(X_train, y_train)
y_pred = pipe2.predict(X_test)
bac = r2_score(y_test, y_pred) 

print("R2: ", round(bac, 4))
R2:  0.8158

Visualização do Erro#

from yellowbrick.regressor import prediction_error

visualizer = prediction_error(pipe2, X_train, y_train, X_test, y_test)
../_images/637f11bb8c112150c37936dce3e3779c9ab3ec8343879a5ea6ad85118c9330bf.png
from yellowbrick.regressor import residuals_plot

viz = residuals_plot(pipe2, X_train, y_train, X_test, y_test)
../_images/d10d7cae642cdd2fa5017848ef3a1c3c4d855399583fd9525152dd3bbe9153ce.png

Referências & Links#

  1. Ensemble methods

  2. GradientBoostingRegressor

  3. RandomForestRegressor

  4. Avocado Prices